A New Statistical Parser Based on Bigram Lexical Dependencies 
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Abstract 

This paper describes a new statistical 
parser which is based on probabilities of 
dependencies between head-words in the 
parse tree. Standard bigram probability es- 
timation techniques are extended to calcu- 
late probabilities of dependencies between 
pairs of words. Tests using Wall Street 
Journal data show that the method per- 
forms at least as well as SPATTER (Mager- 
man 95; Jelinek et al. 94), which has 
the best published results for a statistical 
parser on this task. The simplicity of the 
approach means the model trains on 40,000 
sentences in under 15 minutes. With a 
beam search strategy parsing speed can be 
improved to over 200 sentences a minute 
with negligible loss in accuracy. 

1 Introduction 

Lexical information has been shown to be crucial for 
many parsing decisions, such as prepositional-phrase 
attachment (for example (Hindle and Rooth 93)). 
However, early approaches to probabilistic parsing 
(Pereira and Schabes 92; Magerman and Marcus 91; 
Briscoe and Carroll 93) conditioned probabilities on 
non-terminal labels and part of speech tags alone. 
The SPATTER parser (Magerman 95; Jelinek et al. 
94) does use lexical information, and recovers labeled 
constituents in Wall Street Journal text with above 
84% accuracy - as far as we know the best published 
results on this task. 

This paper describes a new parser which is much 
simpler than SPATTER, yet performs at least as well 
when trained and tested on the same Wall Street 
Journal data. The method uses lexical informa- 
tion directly by modeling head-modifiercl relations 
between pairs of words. In this way it is similar to 



Link grammars (Lafferty et al. 92), and dependency 
grammars in general. 

2 The Statistical Model 

The aim of a parser is to take a tagged sentence 
as input (for example Figure |l|(a)) and produce a 
phrase-structure tree as output (Figure [j](b)). A 
statistical approach to this problem consists of two 
components. First, the statistical model assigns a 
probability to every candidate parse tree for a sen- 
tence. Formally, given a sentence S and a tree T, the 
model estimates the conditional probability P(T\S). 
The most likely parse under the model is then: 



Tbest = argmax T P(T\S) 



(1) 



Second, the parser is a method for finding Xf, est . 
This section describes the statistical model, while 
section || describes the parser. 

The key to the statistical model is that any tree 
such as Figure 0(b) can be represented as a set of 
baseNPsH and a set of dependencies as in Fig- 
ure |l|(c). We call the set of baseNPs B, and the 
set of dependencies D; Figure 0(d) shows B and D 
for this example. For the purposes of our model, 
T =(£,£>), and: 

P(T\S) = P(B, D\S) = P{B\S) x P(D\S, B) (2) 

S is the sentence with words tagged for part of 
speech. That is, S —< (u>i, t±), (wz, t2)-..(w n , t n ) >. 
For POS tagging we use a maximum-entropy tag- 
ger described in (Ratnaparkhi 96). The tagger per- 
forms at around 97% accuracy on Wall Street Jour- 
nal Text, and is trained on the first 40,000 sentences 
of the Penn Treebank (Marcus et al. 93). 

Given S and B, the reduced sentence S is de- 
fined as the subsequence of S which is formed by 
removing punctuation and reducing all baseNPs to 
their head- word alone. 



"This research was supported by ARPA Grant 
N6600194-C6043. 

1 By 'modifier' we mean the linguistic notion of either 
an argument or adjunct. 



A baseNP or 'minimal' NP is a non-recursive NP, 
i.e. none of its child constituents are NPs. The term 
was first used in (Ramshaw and Marcus 95). 



(a) 



John/NNP Smith/NNP, the/DT president/NN of/IN IBM/NNP, announced/VBD his/PRP$ res- 
ignation/NN yesterday/NN . 



(b) 




NP 



PP 



NNP NNP DT 



XN 



IN NP 

III ; NNP 

John Smith the president of IBM 



PRP$ NN 



announced his resignation yesterday 



(c) 



NP NP NP NPNPPP 1NPPNP 

I f I t if I i V f~ 

| John Smith) [ the president | of | IBM ] announced [ his resignation ] [ yesterday 



(d) 

B={ [John Smith] , [the president], [IBM], [his resignation], [yesterday] } 



NP S VP NP NP NP NP NP PP IN PP NP VBD VP NP 

l l : l : l ! l 

D={ Smith announced , Smith president , president of, of IBM, announced resignation 



VBD VP NP 



announced yesterday 



Figure 1: An overview of the representation used by the model, (a) The tagged sentence; (b) A candidate 
parse-tree (the correct one); (c) A dependency representation of (b). Square brackets enclose baseNPs 
(heads of baseNPs are marked in bold). Arrows show modifier — > head dependencies. Section 2.1 describes 
how arrows are labeled with non-terminal triples from the parse-tree. Non-head words within baseNPs are 
excluded from the dependency structure; (d) B, the set of baseNPs, and D, the set of dependencies, are 
extracted from (c). 



Thus the reduced sentence is an array of word/tag 
pairs, S =< (wi,ii), {w2,te)-..(w m ,i m ) >, where 
m < n. For example for Figure |l|(a) 

Example 1 S = 

< {Smith, NNP) , {president, NN) , (of, IN) , 
{IBM, NNP), (announced, VBD), 
(resignation, NN), (yesterday, NN) > 

Sections 2.1 to 2.4 describe the dependency model. 
Section 2.5 then describes the baseNP model, which 
uses bigram tagging techniques similar to (Ramshaw 
and Marcus 95; Church 88). 

2.1 The Mapping from Trees to Sets of 
Dependencies 

The dependency model is limited to relationships 
between words in reduced sentences such as Ex- 



ample yj. The mapping from trees to dependency 
structures is central to the dependency model. It is 
defined in two steps: 

1. For each constituent P — ><.Ci...(7 n > in the 
parse tree a simple set of rulesfj identifies which 
of the children Cj is the 'head-child' of P. For 
example, NN would be identified as the head-child 
of NP -> <DET JJ JJ NN>, VP would be identified 
as the head-child of S — > <NP VP>. Head- words 
propagate up through the tree, each parent receiv- 
ing its head-word from its head-child. For example, 
in S — > <NP VP>, S gets its head-word, announced, 

3 The rules are essentially the same as in (Magerman 
95; Jelinek et al. 94). These rules are also used to find 
the head-word of baseNPs, enabling the mapping from 
5" and B to 3. 



from its head-child, the VP. 



NPESmith) 

NP(Smith) NP(president) 

NP(prcsident) PP(ol) 

IN NP 

NN I I 

| NNP 

resident of IBM 




VBD(aimounced) NP(. resignation) NP( yesterday! 



.iced resignation yesterday 



Figure 2: Parse tree for the reduced sentence in 
Example 1. The head-child of each constituent is 
shown in bold. The head- word for each constituent 
is shown in parentheses. 

2. Head- modifier relationships are now extracted 
from the tree in Figure |^. Figure illustrates how 
each constituent contributes a set of dependency re- 
lationships. VBD is identified as the head-child of 
VP — > <VBD NP NP>. The head- words of the two 
NPs, resignation and yesterday, both modify the 
head-word of the VBD, announced. Dependencies are 
labeled by the modifier non-terminal, NP in both of 
these cases, the parent non-terminal, VP, and finally 
the head-child non-terminal, VBD. The triple of non- 
terminals at the start, middle and end of the arrow 
specify the nature of the dependency relationship - 
<NP,S,VP> represents a subject- verb dependency, 
<PP,NP,NP> denotes prepositional phrase modifi- 
cation of an NP, and so onta. 



VBD(Hiinimnted) NPItl- 



Figure 3: Each constituent with n children (in this 
case n = 3) contributes n — 1 dependencies. 

Each word in the reduced sentence, with the ex- 
ception of the sentential head 'announced', modifies 
exactly one other word. We use the notation 



AF(j) = (h j ,R j ) 



(3) 



to state that the jth word in the reduced sentence 
is a modifier to the hjth word, with relationship 
Rju. AF stands for 'arrow from'. Rj is the triple 
middle and end of the ar- 



of labels at the start, 
For example, w\ 



row. 



Smith in this sentence, 



4 The triple can also be viewed as representing a se- 
mantic predicate-argument relationship, with the three 
elements being the type of the argument, result and func- 
tor respectively. This is particularly apparent in Cat- 
egorial Grammar formalisms (Wood 93), which make 
an explicit link between dependencies and functional 
application. 

5 For the head- word of the entire sentence hj = 0, with 
i?j = <Label of the root of the parse tree >. So in this 
case, AF(5) = (0, < S >). 



and u>5 = announced, so AF(1) = (5, <NP,S,VP>). 

D is now defined as the m-tuple of dependen- 
cies: D = {(AF(l),AF(2)...AF(m)}. The model 
assumes that the dependencies are independent, so 
that: 



P(D\S, B) 



]JP(AF(j)\S,B) 



(4) 



2.2 Calculating Dependency Probabilities 

This section describes the way P(AF(j)\S, B) is es- 
timated. The same sentence is very unlikely to ap- 
pear both in training and test data, so we need to 
back-off from the entire sentence context. We believe 
that lexical information is crucial to attachment de- 
cisions, so it is natural to condition on the words and 
tags. Let V be the vocabulary of all words seen in 
training data, T be the set of all part-of-speech tags, 
and TRAIN be the training set, a set of reduced 
sentences. We define the following functions: 

• C {(a,b) , (c, d) ) for a, c e V, and b, d e T is the 
number of times {a, b) and (c,d) are.-seen in the 
same reduced sentence in training data.B Formally, 

C{(a,b),(c,d)) = 

]T h(S[k] = (a,b),S[l] = (c,d)) (5) 

S e TTIATM 
k,l = l..\S\, l=£k 

where h{x) is an indicator function which is 1 if x is 
true, if £ is false. 

• C (R, (a, b) , (c, d) ) is the number of times (a, b) 
and (c, d) are seen in the same reduced sentence in 
training data, and (a, b) modifies (c, d) with rela- 
tionship R. Formally, 

C(R, (a,b),(c,d)) = 



h(S[k] = (a, b) , S[l] = (c, d) , AF(k) = (I, R)) 



k,l=l..\S\, l^k 



(6) 



• F(R | (a, b) , (c, d) ) is the probability that (a, b) 
modifies (c, d) with relationship R, given that (a, b) 
and (c, d) appear in the same reduced sentence. The 
maximum-likelihood estimate of F (R \ (a, b) , (c, d) ) 
is: 

m(aSAc,d)) = %^^ (7) 
C{ (a,b) , (c,d) ) 

We can now make the following approximation: 

P(AF(j) = (h j ,R j )\S,B)K 

F{Rj | (wjjj) , (w hj ,t hj )) 



(8) 



Note that we count multiple co-occurrences in a 
single sentence, e.g. if S = (< a,b >, < c, d > , < c, d >) 
then C(< a, b >, < c, d >) = C(< c, d >, < a, b >) = 2. 



where V is the set of all triples of non-terminals. The 
denominator is a normalising factor which ensures 
that 

P(AF(j) = (k,p)\S,B) = l 

k—l..m,k^j 7 pe'P 

From (|) and (§): 
P(D\S,B)k (9) 

■TT F ( R j\{w^tj), {w hj ,t hj )) 

3 = 1 T,k=l.. m< kjtj,pevF(P\ <%>*i> . (Wk,h)) 

The denominator of (^|) is constant, so maximising 
P(D\S, B) over D for fixed S, B is equivalent to max- 
imising the product of the numerators, J\f(D\S,B). 
(This considerably simplifies the parsing process): 

m 

Af(D\S, B) = ]J F(R | {vijjj) , (w h] , t hi ) ) (10) 

.7=1 

2.3 The Distance Measure 

An estimate based on the identities of the two tokens 
alone is problematic. Additional context, in partic- 
ular the relative order of the two words and the dis- 
tance between them, will also strongly influence the 
likelihood of one word modifying the other. For ex- 
ample consider the relationship between 'sales' and 
the three tokens of 'of: 

Example 2 Shaw, based in Dalton, Ga., has an- 
nual sales of about $ 1.18 billion, and has economies 
of scale and lower raw-material costs that are ex- 
pected to boost the profitability of Armstrong 's 
brands, sold under the Armstrong and Evans-Black 
names . 

In this sentence 'sales' and 'of co-occur three 
times. The parse tree in training data indicates a 
relationship in only one of these cases, so this sen- 
tence would contribute an estimate of | that the 
two words are related. This seems unreasonably low 
given that 'sales of is a strong collocation. The lat- 
ter two instances of 'of are so distant from 'sales' 
that it is unlikely that there will be a dependency. 

This suggests that distance is a crucial variable 
when deciding whether two words are related. It is 
included in the model by defining an extra 'distance' 
variable, A, and extending C, F and F to include 
this variable. For example, C( (a, b) , (c, d) , A) is 
the number of times (a, b) and (c, d) appear in the 
same sentence at a distance A apart. ( pd| ) is then 
maximised instead of (|l0|): 

m 

M{D \S,B) = Y[F{Rj | (wj , tj ) , {w hj , t hj ) , A Ji/y ) 

(11) 

A simple example of Aj .hj would be A^/j. — hj — j. 
However, other features of a sentence, such as punc- 
tuation, are also useful when deciding if two words 



are related. We have developed a heuristic 'dis- 
tance' measure which takes several such features into 
account The current distance measure Aj ; /, . is the 
combination of 6 features, or questions (we motivate 
the choice of these questions qualitatively - section |] 
gives quantitative results showing their merit): 

Question 1 Does the hjth word precede or follow 
the jth word? English is a language with strong 
word order, so the order of the two words in surface 
text will clearly affect their dependency statistics. 

Question 2 Are the hjth word and the jth word 
adjacent? English is largely right-branching and 
head-initial, which leads to a large proportion of de- 
pendencies being between adjacent words 0. Table |l| 
shows just how local most dependencies are. 



Distance 1 < 2 < 5 < 10 

Percentage 74.2 86.3 95.6 99.0 



Table 1: Percentage of dependencies vs. distance be- 
tween the head words involved. These figures count 
baseNPs as a single word, and are taken from WSJ 
training data. 



Number of verbs <=1 <=2 
Percentage 94.1 98.1 99.3 



Table 2: Percentage of dependencies vs. number of 
verbs between the head words involved. 

Question 3 Is there a verb between the hjth word 
and the jth word? Conditioning on the exact dis- 
tance between two words by making Aj.tj — hj — j 
leads to severe sparse data problems. But Table □ 
shows the need to make finer distance distinctions 
than just whether two words are adjacent. Consider 
the prepositions 'to', 'in' and 'of in the following 
sentence: 

Example 3 Oil stocks escaped the brunt of Fri- 
day 's selling and several were able to post gains , 
including Chevron , which rose 5/8 to 66 3/8 in 
Big Board composite trading of 2.4 million shares . 

The prepositions' main candidates for attachment 
would appear to be the previous verb, 'rose', and 
the baseNP heads between each preposition and this 
verb. They are less likely to modify a more distant 
verb such as 'escaped'. Question 3 allows the parser 
to prefer modification of the most recent verb - effec- 
tively another, weaker preference for right-branching 
structures. Table || shows that 94% of dependencies 
do not cross a verb, giving empirical evidence that 
question 3 is useful. 



7 For example in '(John (likes (to (go (to (University 
(of Pennsylvania)))))))' all dependencies are between ad- 
jacent words. 



Questions 4, 5 and 6 

• Are there 0, 1, 2, or more than 2 'commas' be- 
tween the hjth word and the jth word? (All 
symbols tagged as a ',' or ':' are considered to 
be 'commas'). 

• Is there a 'comma' immediately following the 
first of the hjth word and the jth word? 

• Is there a 'comma' immediately preceding the 
second of the hjth word and the jth word? 

People find that punctuation is extremely useful 
for identifying phrase structure, and the parser de- 
scribed here also relies on it heavily. Commas are 
not considered to be words or modifiers in the de- 
pendency model - but they do give strong indica- 
tions about the parse structure. Questions 4, 5 and 
6 allow the parser to use this information. 

2.4 Sparse Data 

The maximum likelihood estimator in (]?]) is 
likely to be plagued by sparse data problems - 
C( (u>j , (w hj , i hj } , A Jthj ) may be too low to give 
a reliable estimate, or worse still it may be zero leav- 
ing the estimate undefined. (Collins 95) describes 
how a backed-off estimation strategy is used for mak- 
ing prepositional phrase attachment decisions. The 
idea is to back-off to estimates based on less context. 
In this case, less context means looking at the POS 
tags rather than the specific words. 

There are four estimates, Ei, E 2 , E 3 and E4, 
based respectively on: 1) both words and both tags; 
2) Wj and the two POS tags; 3) Wh } and the two 
POS tags; 4) the two POS tags alone. 



E, = f E 2 = % E, = f s E^f- (12) 



whereE 



51 = C{ (wj ,tj), (w^ , t hj ) , A jth . ) 

52 = C{ (tBj, tj), (i hj ),Aj, hj ) 

5 3 = C{(tj), (w hj ,i hj ),A jjhj ) 

5 4 = Cdij), (tOAvy) 

r}i = C {Rj , (wj ,tj), (w h . ,t hj ), A jihj ) 

7/ 2 = C {Rj , (wj ,tj), {i h j ) , A jjhj ) 

773 = C (Rj , (tj ) , {w hj ,i hj ), Aj,^ ) 

r H = C{Rj,{ij),(i hj ),A jthj ) (13) 



C( (%, tj ) , {t hj ) , Aj, hj ) = C( (Wj , tj ),{x, t hj ),Aj, hj ) 
C( , (t hj ) , Aj, hj ) = ^2J2c({x, ij) , {y, t hj ) , Aj, hj ) 



Estimates 2 and 3 compete - for a given pair of 
words in test data both estimates may exist and 
they are equally 'specific' to the test case example. 
(Collins 95) suggests the following way of combining 
them, which favours the estimate appearing more 
often in training data: 



E- 



23 



m + m 



(14) 



S2 + s 3 

This gives three estimates: E±, E23 and E4, a 
similar situation to trigram language modeling for 
speech recognition (Jelinek 90), where there are tri- 
gram, bigram and unigram estimates. (Jelinek 90) 
describes a deleted interpolation method which com- 
bines these estimates to give a 'smooth' estimate, 
and the model uses a variation of this idea: 

If Ex exists, i.e. <5i > 

F(Rj I (wj,ij) , {w hj ,i hj ) ,Aj, hj ) = 

X 1 xE 1 + {l- Ax) x E 23 (15) 

Else If E 2 3 exists, i.e. <5 2 + <5 3 > 

F(Rj I (wj {w hj ,t hj ), A,- h . ) = 

A 2 x E 2Z + (1 - A 2 ) x E A (16) 

Else 

F(Rj I (wj , ij } , {w hj , t hj ) , Aj, hj ) = £4 (17) 

(Jelinek 90) describes how to find A values 
in (15[) and (|l^) which maximise the likelihood of 
helcUout data. We have taken a simpler approach, 
namely: 



Ai = 



Si 



Si + 1 



s 2 + s 3 + i 



(18) 



where V is the set of all words seen in training data: the 
other definitions of C follow similarly. 



These A values have the desired property of increas- 
ing as the denominator of the more 'specific' esti- 
mator increases. We think that a proper implemen- 
tation of deleted interpolation is likely to improve 
results, although basing estimates on co-occurrence 
counts alone has the advantage of reduced training 
times. 

2.5 The BaseNP Model 

The overall model would be simpler if we could do 
without the baseNP model and frame everything in 
terms of dependencies. However the baseNP model 
is needed for two reasons. First, while adjacency be- 
tween words is a good indicator of whether there 
is some relationship between them, this indicator 
is made substantially stronger if baseNPs are re- 
duced to a single word. Second, it means that 
words internal to baseNPs are not included in the 
co-occurrence counts in training data. Otherwise, 



in a phrase like 'The Securities and Exchange Com- 
mission closed yesterday', pre-modifying nouns like 
'Securities' and 'Exchange' would be included in co- 
occurrence counts, when in practice there is no way 
that they can modify words outside their baseNP. 

The baseNP model can be viewed as tagging 
the gaps between words with S(tart), C(ontinue), 
E(nd), B(etween) or N(ull) symbols, respectively 
meaning that the gap is at the start of a BaseNP, 
continues a BaseNP, is at the end of a BaseNP, is 
between two adjacent baseNPs, or is between two 
words which are both not in BaseNPs. We call the 
gap before the ith word Gi (a sentence with n words 
has n — 1 gaps). For example, 

[ John Smith ] [ the president ] of [ IBM ] has an- 
nounced [ his resignation ] [ yesterday ] => 
John C Smith B the C president E of S IBM E has 
N announced S his C resignation B yesterday 

The baseNP model considers the words directly to 
the left and right of each gap, and whether there is 
a comma between the two words (we write c; = 1 
if there is a comma, a = otherwise). Probability 
estimates are based on counts of consecutive pairs of 
words in unreduced training data sentences, where 
baseNP boundaries define whether gaps fall into the 
S, C, E, B or N categories. The probability of 
a baseNP sequence in an unreduced sentence S is 
then: 

][| P(Gi\wi-i,ti-i,Wi,ti,Ci) (19) 

i=2...n 

The estimation method is analogous to that de- 
scribed in the sparse data section of this paper. The 
method is similar to that described in (Ramshaw and 
Marcus 95; Church 88), where baseNP detection is 
also framed as a tagging problem. 

2.6 Summary of the Model 

The probability of a parse tree T, given a sentence 
S, is: 

P(T\S) = P{B,D\S) = P{B\S) x P{D\S,B) 
The denominator in Equation (^) is not actu- 
ally constant for different baseNP sequences, but we 
make this approximation for the sake of efficiency 
and simplicity. In practice this is a good approxima- 
tion because most baseNP boundaries are very well 
defined, so parses which have high enough P(B\S) 
to be among the highest scoring parses for a sen- 
tence tend to have identical or very similar haseNPs. 
Parses are ranked by the following quantity^]: 

P{B\S) x Af(D\S,B) (20) 

Equations @ and (0) define P(B\S) and 
Af(D\S,B). The parser finds the tree which max- 
imises ( |20| ) subject to the hard constraint that de- 
pendencies cannot cross. 

In fact we also model the set of unary productions, 
U, in the tree, which are of the form P ^< C\ >. This 
introduces an additional term, P(U\B, S), into (EQ). 



2.7 Some Further Improvements to the 
Model 

This section describes two modifications which im- 
prove the model's performance. 

• In addition to conditioning on whether depen- 
dencies cross commas, a single constraint concerning 
punctuation is introduced. If for any constituent Z 
in the chart Z — ► < . . X Y . . > two of its children 
X and Y are separated by a comma, then the last 
word in Y must be directly followed by a comma, or 
must be the last word in the sentence. In training 
data 96% of commas follow this rule. The rule also 
has the benefit of improving efficiency by reducing 
the number of constituents in the chart. 

• The model we have described thus far takes the 
single best sequence of tags from the tagger, and 
it is clear that there is potential for better integra- 
tion of the tagger and parser. We have tried two 
modifications. First, the current estimation meth- 
ods treat occurrences of the same word with differ- 
ent POS tags as effectively distinct types. Tags can 
be ignored when lexical information is available by 
defining 

C(a,c)= C((a,b), (c,<l)) (21) 

b,deT 

where T is the set of all tags. Hence C (a, c) is the 
number of times that the words a and c occur in 
the same sentence, ignoring their tags. The other 
definitions in (0) are similarly redefined, with POS 
tags only being used when backing off from lexical 
information. This makes the parser less sensitive to 
tagging errors. 

Second, for each word Wi the tagger can provide 
the distribution of tag probabilities P(ti\S) (given 
the previous two words are tagged as in the best 
overall sequence of tags) rather than just the first 
best tag. The score for a parse in equation ( |20| ) then 
has an additional term, J\™ =1 P(ti\S), the product of 
probabilities of the tags which it contains. 

Ideally we would like to integrate POS tagging 
into the parsing model rather than treating it as a 
separate stage. This is an area for future research. 

3 The Parsing Algorithm 

The parsing algorithm is a simple bottom-up chart 
parser. There is no grammar as such, although 
in practice any dependency with a triple of non- 
terminals which has not been seen in training 
data will get zero probability. Thus the parser 
searches through the space of all trees with non- 
terminal triples seen in training data. Probabilities 
of baseNPs in the chart are calculated using jl9|), 
while probabilities for other constituents are derived 
from the dependencies and baseNPs that they con- 
tain. A dynamic programming algorithm is used: 
if two proposed constituents span the same set of 
words, have the same label, head, and distance from 



IViUJJrjlj 


< 40 Words (2245 sentences) 




100 Words (2416 sentences) 


LR 


LP 


CBs 


CBs 


< 2 CBs 


LR 


LP 


CBs 


CBs 


< 2 CBs 


(i) 


84.9% 


84.9% 


1.32 


57.2% 


80.8% 


84.3% 


84.3% 


1.53 


54.7% 


77.8% 


(2) 


85.4% 


85.5% 


1.21 


58.4% 


82.4% 


84.8% 


84.8% 


1.41 


55.9% 


79.4% 


(3) 


85.5% 


85.7% 


1.19 


59.5% 


82.6% 


85.0% 


85.1% 


1.39 


56.8% 


79.6% 


(4) 


85.8% 


86.3% 


1.14 


59.9% 


83.6% 


85.3% 


85.7% 


1.32 


57.2% 


80.8% 


SPATTER 


84.6% 


84.9% 


1.26 


56.6% 


81.4% 


84.0% 


84.3% 


1.46 


54.0% 


78.8% 



Table 3: Results on Section 23 of the WSJ Treebank. (1) is the basic model; (2) is the basic model 
with the punctuation rule described in section 2.7; (3) is model (2) with POS tags ignored when lexical 
information is present; (4) is model (3) with probability distributions from the POS tagger. LR/LP = 
labeled recall/precision. CBs is the average number of crossing brackets per sentence. CBs, < 2 CBs 
are the percentage of sentences with or < 2 crossing brackets respectively. 

VP 



VBD ^JJP^ - 
announced his resignation 
Score=Sl Score=S2 



VBD ^JNP^ 
announced his resignation 

Score = SI * S2 * 
P(Gap=S I announced, his) * 
P(<np,vp,vbd> I resignation, announced) 



Figure 4: Diagram showing how two constituents 
join to form a new constituent. Each operation gives 
two new probability terms: one for the baseNP gap 
tag between the two constituents, and the other for 
the dependency between the head words of the two 
constituents. 

the head to the left and right end of the constituent, 
then the lower probability constituent can be safely 
discarded. Figure shows how constituents in the 
chart combine in a bottom-up manner. 

4 Results 

The parser was trained on sections 02 - 21 of the Wall 
Street Journal portion of the Penn Treebank (Mar- 
cus et al. 93) (approximately 40,000 sentences), and 
tested on section 23 (2,416 sentences). For compari- 
son SPATTER (Magerman 95; Jelinek et al. 94) was 
also tested on section 23. We use the PARSEVAL 
measures (Black et al. 91) to compare performance: 

Labeled Precision = 

number of correct constituents in proposed parse 

number of constituents in proposed parse 

Labeled Recall = 

number of correct constituents in proposed parse 

number of constituents in treebank parse 

Crossing Brackets = number 

of constituents which violate constituent bound- 
aries with a constituent in the treebank parse. 
For a constituent to be 'correct' it must span the 
same set of words (ignoring punctuation, i.e. all to- 
kens tagged as .commas, colons or quotes) and have 
the same label£3 as a constituent in the treebank 



Distance 
Measure 


Lexical 
Information 


LR 


LP 


CBs 


Yes 


Yes 


85.0% 


85.1% 


1.39 


Yes 


No 


76.1% 


76.6% 


2.26 


No 


Yes 


80.9% 


83.6% 


1.51 



Table 4: The contribution of various components of 
the model. The results are for all sentences of < 100 
words in section 23 using model (3). For 'no lexi- 
cal information' all estimates are based on POS tags 
alone. For 'no distance measure' the distance mea- 
sure is Question 1 alone (i.e. whether uij precedes 
or follows Wh a ) ■ 

parse. Four configurations of the parser were tested: 

(1) The basic model; (2) The basic mod el with the 
punctuation rule described in section 5^ (3) Model 

(2) with tags ignored whe n lexical information is 
present, as described in 2.7; and (4) Model (3) also 
using the full probability distributions for POS tags. 
We should emphasise that test data outside of sec- 
tion 23 was used for all development of the model, 
avoiding the danger of implicit training on section 
23. Table shows the results of the tests. Table 
shows results which indicate how different parts of 
the system contribute to performance. 



10 SPATTER collapses ADVP and PRT to the same label, 
for comparison we also removed this distinction when 



4.1 Performance Issues 

All tests were made on a Sun SPARCServer 1000E, 
using 100% of a 60Mhz SuperSPARC processor. The 
parser uses around 180 megabytes of memory, and 
training on 40,000 sentences (essentially extracting 
the co-occurrence counts from the corpus) takes un- 
der 15 minutes. Loading the hash table of bigram 
counts into memory takes approximately 8 minutes. 

Two strategies are employed to improve parsing 
efficiency. First, a constant probability threshold is 
used while building the chart - any constituents with 
lower probability than this threshold are discarded. 
If a parse is found, it must be the highest ranked 
parse by the model (as all constituents discarded 
have lower probabilities than this parse and could 



calculating scores. 



not, therefore, be part of a higher probability parse). 
If no parse is found, the threshold is lowered and 
parsing is attempted again. The process continues 
until a parse is found. 

Second, a beam search strategy is used. For each 
span of words in the sentence the probability, Ph , of 
the highest probability constituent is recorded. All 
other constituents spanning the same words must 
have probability greater than ^ for some constant 
beam size f3 - constituents which fall out of this 
beam are discarded. The method risks introduc- 
ing search-errors, but in practice efficiency can be 
greatly improved with virtually no loss of accuracy. 
Table ^ shows the trade-off between speed and ac- 
curacy as the beam is narrowed. 



Beam 
Size (3 


Speed 
Sentences /minute 


LR 


LP 


CBs 


1000 


118 


84.9% 


85.1% 


1.39 


150 


166 


84.8% 


85.1% 


1.38 


20 


217 


84.7% 


85.0% 


1.40 


3 


261 


84.1% 


84.5% 


1.44 


1.5 


283 


83.7% 


84.1% 


1.48 


1.2 


289 


83.5% 


83.9% 


1.50 



Table 5: The trade-off between speed and accuracy 
as the beam-size is varied. Model (3) was used for 
this test on all sentences < 100 words in section 23. 



5 Conclusions and Future Work 

We have shown that a simple statistical model 
based on dependencies between words can parse 
Wall Street Journal news text with high accuracy. 
The method is equally applicable to tree or depen- 
dency representations of syntactic structures. 

There are many possibilities for improvement, 
which is encouraging. More sophisticated estimation 
techniques such as deleted interpolation should be 
tried. Estimates based on relaxing the distance mea- 
sure could also be used for smoothing - at present we 
only back-off on words. The distance measure could 
be extended to capture more context, such as other 
words or tags in the sentence. Finally, the model 
makes no account of valency. 
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